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1 Introduction 

This paper is concerned with the effective par- 
allel symbolic computation of operators under 
composition; Examples include differential oper- 
ators under composition and vector fields under 
the Lie bracket. In general, such operators do not 
commute. An important problem is to find effi- 
cient algorithms to write expressions involving 
noncommuting operators in terms of operators 
which do commute. If the original expression 
enjoys a certain symmetry, then naive rewriting 
requires the computation of terms which in the 
end cancel. In [8], we gave an algorithm which 
in some cases' is exponentially faster than the 
naive expansion of the noncommutating opera- 
tors. The pulpose of this paper is show how that 
algorithm can be naturally parallelized. 

In Section 2, we give a careful statement of 
the problem. In Section 3, we discuss data struc- - 
tures consisting of formal linear combinations of 
rooted labeled trees. We define jl multiplication 
on rooted labeled trees, thereby making the set of 
these data structures into an associative algebra. 
We theirdefme an algebra homomorphism from 
the original algebra of operators into this alge- 
bra of trees. In Section 4, we^describe an alge- 
bra homomorphism from the algebra of trees into 
the algebra of differential operators. The cancel- 1 
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lation which occurs when noncommuting opera- 
tors are expressed in terms of commuting ones 
occurs naturally when the operators are repre- 
sented using this data structure. This leads to an , ^ 
algorithm which, for operators which are deriva- . 
tions, speeds up the computation exponentially . 
in the degree of the operator. This is described . 
in Section 5. Sections 3-5 follow the treatment 
_of [8], In Section 6, we show how the algebra of 
trees leads naturally to a parallel version of the 
algorithm. l 

Herels a concrete example of the type of com- 
putations we are .concerned with. Fix^ three vec- 
tor fields £^,£3 in with polynomial co- 
efficients a]: 


Considering the vector fields as first-order dif- 
ferential operators, it Is natural to form higher- 
order differential operators from them, such as 
the third-order differential operator 

p = E3E2 E\ — E3E1E2 — EaE\E^ + £^£2 -C^- 

Writing this differential operator in terms of the 
djdx 1 , . . . , d/dxtf yields a first-order differential 
operator because of the symmetry of the expres- 
sion p causes all second- and third-prder terms 
to cancel. 

In this paper we analyse an algorithm for ex- 
pressing differential operators p in termg of the 
commuting derivations djdx 1, djdx n in 

such a way that second and third order terms 
which cancel are not computed. In the example 
above, the naive computation would require the 
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computation of 24 N 3 terms, while the algorithm 
we describe here would involve just the compu- 
tation of the 6N 3 terms which do not cancel. 

We conclude this introduction with some re- 
marks. 

L In actual applications expressions possessing 
symmetry arise more often than not. For ex- 
ample, Lie brackets of vector fields possess 
a great deal of symmetry. The algorithm 
we discuss is designed to take advantage of 
such symmetries, if they are present, with- 
out the necessity of explicitly identifying the 
symmetries. 

2. Once a set of data structures has been given 
an algebraic structure, it becomes natural 
to view algorithms concerned with simplifi- 
cation as simply the factoring of a map into 
the composition of a map into the algebra of 
these data structures, and a map from this 
algebra. This is the simple idea which is at 
the basis of the algorithm we describe. We 
expect that this idea will find application 
elsewhere. 

3. See [4] and [3] for previous work on the sim- 
plification of expressions. See [9] and the 
references contained there for previous work 
on parallel symbolic computation. 

2 Higher-order derivations 

In this section we give a careful statement of the 
problem and state the main result. Let R be a 
commutative algebra with unit over the field k . 
(Throughout this paper A: is a field of character- 
istic 0.) A derivation of the algebra R is a linear 
map D of R to itself satisfying 

D(ab) = aD(b ) + 6D(a), for all a, b E R. 

Let Di, . . . , Dn be N commuting derivations of 
12, that is, for i, j = 1, . . . , 1\T, 

D{Dja — DjD{a , for all a E 12. 

Suppose that we are also given M derivations E \ , 
. . . , Em of R which can be expressed as 12-linear 


combinations of the derivations A; that is, for 

j = 1, M, 

N 

Ej = ^ where a 1 - E 12. (1) 

m=i 

We are interested in writing higher-order deriva- 
tions generated by the Em in terms of 

the commuting derivations Z?i, D w- More 

formally, let k<E \ , . . . , Em > denote the free as- 
sociative algebra in the symbols E \ , . . . , Em and 
let DifF(Di, . . . , Av; 12) denote the space of for- 
mal linear differential operators with coefficients 
from 12; that is, Diff(A , . . . , Av ; 12) consists of 
all finite formal expressions 

N N 

M2 D} L2 Am +••• 

^1 = 1 M 1,M2 = 1 

where a Ml , a M1 ^ 2 , . . . E 12. We let 

X : k<E u . . . ,Em> — * Diff(Z?i , . . . , D N \ 12) 

denote the map which sends pE k<E \, . . . , Em> 
to the linear differential operator x(p) obtained 
by performing the substitution (1) and simplify- 
ing using the fact that the are derivations of 
12. 

Suppose p E k<E \ , . . . , Em > is of the form 

i 

p = X>, 

i=l 

where each term pi is of degree m. The naive 
computation of x(p) would compute x(P«)> f° r 
i = 1, . . . This would yield l m\ N m terms. As- 
sume CostA(p), the cost of applying algorithm /I 
to simplify p E k<E\, . .. , Em > , is proportional 
to the number of differentiations and multiplica- 
tions. Then 

CostN AIVE (p) — 0{lmm\ N m ) 

In Section 5, we describe an algorithm which pre- 
processes an expression p in such a way that any 
terms which cancel after the substitution (1) are 
not computed. We show: 

Theorem 1 Assume that 



(i) p is the sum of l = 2 m 1 terms , eac/i ho- 
mogenous of degree m; 

(ii) L — x(p) 25 a hnear differential operator of 
degree L 

(Hi) m,N oo in such a way that 2 m ~ 2 m <C 

jV m . 

Then 

CostBETTER(p) _ q ( 1 \ 

CostNAlVE(p) ) 

In Section 6, we describe how this precomputa- 
tion can be done as a parallel computation. 

Observe that a Lie bracket of degree m on R^, 
for large enough iV, satisfies the hypotheses of 
the theorem. 

3 Trees and differential opera- 
tors 

In this section we describe the connection be- 
tween algebras and trees which is essential for 
the description of the data structures which we 
use in the next section, and for the analysis of 
the algorithms which use those data structures. 

By a tree we mean a rooted finite tree [10]. If 
{Ei , . . . , Em} is a set of symbols, we will say a 
tree is labeled with {E\, . .., Em] if every node 
of the tree other than the root has an element 
of {£i, . . . , Em} assigned to it. We denote the 
set of all trees labeled with {£i, Em} by 
CT{E\, . .., Em)- Let k{C7~(E\, ..., Em)} de- 
note the vector space over k with basis £T(Ei y 
. .., Em)- We show that this vector space is a 
graded connected algebra. 

We define the multiplication in k{CT{E \ , . . . , 
Em)} as follows. Since the set of labeled trees 
form a basis for k{CT{E \ , . . , , Em)}, it is suffi- 
cient to describe the product of two labeled trees. 
Suppose ti and ti are two labeled trees. Let 
. . . , s r be the children of the root of t\ . If has 
n + l nodes (counting the root), there are (n + l) r 
ways to attach the r subtrees of t\ which have s \ , 
. . . , s r as roots to the labeled tree t<i by making 
each S{ the child of some node of t<i , keeping the 
original labels. The product tit 2 is defined to be 
the sum of these (n + l) r labeled trees. It can be 


shown that this product is associative, and that 
the tree consisting only of the root is a multi- 
plicative identity; see [5]. 

We can define a grading on k{CT{E\^ 
Em)} by letting k{£T n (E u Em)} be the 
subspace of k{£T(Ei , . . . , Em)} spanned by the 
trees with n+ 1 nodes. The following theorem is 
proved in [6]. 

Theorem 2 k{CT{E\ , ..., Em)} is a graded 
connected algebra . 

If {E\, . . . , Em} is a set of symbols, then the 
free associative algebra k<E x , ..., Em> is a 
graded connected algebra, and there is an alge- 
bra homomorphism 

<j > : k<E u , . . , Em> — > k{£T(E\, - . . , Em)}* 

The map f> sends E{ to the labeled tree with two 
nodes: the root, and a child of the root labeled 
with E{\ it is then extended to all of k<E \ , . . . , 
Em> by using the fact that it is an algebra ho- 
momorphism. 

We say that a rooted finite tree is heap-ordered 
in case there is a total ordering on all nodes in 
the tree such that each node procedes all of its 
children in the ordering. We say such a tree is la- 
beled with {E\, . . . , Em} in case every element, 
except the root, has an element of {E \ , . . . , Em} 
assigned to it. Let k{CHOT{E \ , . . . , Em)} de- 
note the vector space over k whose basis consists 
of labeled heap-ordered trees. It turns out that 
k{CHOT(E u ...,E M )} is also a graded con- 
nected algebra using the same multiplication de- 
fined above. See [6] for a proof of the following 
theorem. 

Theorem 3 The map 

<t > : k<E x , . . . , E m > - k{CHOT{E l , . . . , E m )} 

is injective . 

4 Simplification of higher or- 
der derivations 

In this section we define a map 
0 : k{£T(Ei , . . . , Em)} - Diff(A , . . . , D N \ R). 
We do this in several steps. 



Step 1. Given a labeled tree t £ CT m (E\, 

£jV/), assign the root the number 0 and 
assign the remaining nodes the num- 
bers 1, m. From now on we iden- 
tify the node with the number assigned 
to it. Let k E nodes t , and suppose 
that are the children of k. Fix 

• • • i W' w ith 

1 < < iV 

and define 

, . . . , /i^/ ) — -D/ij * * * 

if A: is not the root 

= * • • i? M| , 

if k is the root . 

We abbreviate this to Rt(k) or R(k). 
Observe that R t (k) E R for k > 0. 

Step 2. Define 

N 

m= E «(m)---ii(l)i?(0). 

/il, ^ m =l 

Step 3. Extend ip to all Ar{£T(£i, . . . , Ea/)} by 
-linearity. 

The next three propositions describe funda- 
mental properties of the map ip. Note that the 
next proposition is an example of simplification 
by factoring \ through the set of labeled trees: 
we will see that often ip and <p together are 
cheaper to compute than 

Proposition 4 (i) The map ip is an algebra 

homomorphism . 

(H) 

X-ipocp. 

Proof: The proof of (i) is a straightforward ver- 
ification and is contained in [7]. Since x and i }0 4 > 
agree on the generating set E \, . . . ,Em , part (ii) 
follows from part (i). 


5 The cost of computing 
derivations 

In this section, we briefly review the discussion in 
[8] on the work required to write an expression 
composed of noncommuting operators in terms 
of commuting operators. This will prepare us for 
the next section in which we consider the cost 
to simplify such expression given several pro- 
cessors. We make the following asssumptions: 
p E k<E \, . . . , E m > is of the form 

i 

p = Ep*’ 

1=1 

where each term p t is of degree m; the cost of 
a multiplication is one unit and the cost of a 
differention is one unit; the cost of an addition 
is zero units; and the cost of adding a node to 
a tree is one unit, so that the cost of building a 
tree t E CT m {E\ , . . . , Em) is m units. 

Proposition 5 (i) x{p) contains lm\N m 

terms . 

(ii) The cost of computing x{p) 2/mm! N m . 

Proof: Suppose pi is of the form ^7m ’ ’ ‘ > 

for some indices 1 < 7i, . . . , 7m < AT. Then 
X(Pi) is equal to 

/im = 1 Ml = l 

After expansion there are m\N m terms, each of 
which involves m differentions and m multiplica- 
tions. 

Proposition 6 The cost of computing <p(p) is 
/mm!. 

Proof: A monomial of degree m is sent to the 
sum of ml labeled trees under the map <p . This 
follows easily by induction and is contained in [5]. 
By the assumptions above the cost of construct- 
ing a labeled tree with m nodes (in addition to 
the root) is m units. Therefore the total cost is 
Im ml. 


Proposition 7 Let a — 4>(p), and denote by \a\ 
the number of labeled trees with non-zero coeffi- 
cients in a. Then the cost of computing is 
2m|a|iV m . 

Proof: Fix a labeled tree 

t G CT m{E\ , . . . , Em)- 

From the definition of the map we see that the 
cost of computing rp(t) is 2mN m , and hence the 
total cost is 2m\cr\N m . 

Combining these three propositions gives 

Theorem 8 Under the assumptions above , the 
cost CostNAlVE(p) of computing 

i 

x(p) = '52x{Pi) 

1=1 

is 2/mm! A™, while the cost Cos t better (p) of 
computing 

L = 'll) o <£(p) 
zs Im ml + 2m\a\N m . 

Theorem 1 now follows. 

6 Computing derivations with 
several processors 

In the previous sections, we have shown how 
trees are naturally associated with the symbolic 
computation of higher order derivations. In this 
section, we show how trees also lead to natu- 
ral parallel algorithms for symbolic computation. 
Rather than try to state and prove the sharpest 
results, we are content to state and prove an il- 
lustrative theorem of this type. 

The problem is to rewrite the expression p G 
k<E\,. . . ,Em> in terms of commuting opera- 
tors when several processors are available. As 
usual let x{p) € Diff(Di , . . . , D /v; R) denote the 
resulting linear differential operator. Make the 
following asssumptions: 

1. p G &<£i, . . . , Em > is of the form 

i 

p = 5>., 

t= l 

where each term p, is of degree m. 


2. The cost of a multiplication or addition is 
one unit and the cost of a differentiation is 
one unit; the cost of adding a node to a tree 
is one unit, so that the cost of building a 
tree t G CT m (Ei, . Em) is m + 1 units. 

3. We assume that p G k<E \ , . . . , E\f> is in 
its simplest form; in other words, any term 
Eym ' ■ • E n J appears at most once. 

4. We assume that there is one processor avail- 
able for each labeled tree which arises in the 
computation. 

Notation. Each term p, in p G k < E \ , ... , Em> 
is of the form 

^iE'y m * * * E-yi , C{ G h. 

Labellndex is defined to be an index taking val- 
ues between 1 and m. If Labellndex = j , then 
we denote by Label Index(p,) the label in 
the term pi of p. In the precomputation, we as- 
sign one processor for each rooted labeled tree 
in CT(E \ y . . . , Em )• Each processor u has the 
following data structures associated to it: 

1. for each label Ej G {E \ y . . . , Em], a list of 
processors, denoted ProcessorList(£ , J ) or 
ProcessorList(u)(£ , J ); 

2. an array TermCount containing counters 
such that TennCount(u)[i] gives the num- 
ber of times that term p, in the polynomial 
p G fc<£i, Em > , has contributed to 
the tree u; 

3. a variable TreeCoeff icient(u), which will 
be used to store the coefficient k of the tree 
t in a = <£(p). 

We say that the processor u — u t is active in 
case J2i=i TermCount (u)[i] > 0. In other words, 
a processor u — u t , where t G CTk{E \ , . . . , Em), 
is active in case its TermCount array has some 
positive entry. 

We begin by describing a precomputation. 

Step 1. We associate a processor u — u t to each 
tree in LTk{E\^ . . , Em), for k = 1, . . . , m. 



(* Step 0 *) 

for each processor u do simultaneously 
for i := 1 to / do 

TermCount(u)[i] := 0; 

end; 

end; 


Step 2. Let u t be the processor assigned to the 
tree t E £T*(£i, . . . , Em), for k < m, in 
Step 1, with labels E lk , . . . , . Let J £? 7fc+1 

be a label. The tree t yields k + 1 trees la- 
beled with l? 7fc+l , . . . , f? 7l which arise by at- 
taching the node labeled E lk+l to the tree t 
in all possible ways. Since these are labeled 
trees, they have already been assigned a pro- 
cessor by the step above. Let 
denote these processors. In this step, we cre- 
ate the list ProcessorList(JS 7jfc+1 , ti) con- 
taining the processors u \ 9 . . . , Ufc+i. We do 
this for each label yfc+1 G {E \ , . . . , }■ 

We give the algorithm to do the parallel com- 
putation of (j> in Figure 1. We make two remarks. 
First, write conflicts are possible in Step 2 of 
the algorithm. Indeed, consider the addition of 
TermCount(u)[i] to TermCount(u')[i] by proces- 
sor u. Suppose that processor u f is associated 
with tree tf . Then the number of possible incre- 
ments of TermCount(V)[i], if v! is associated with 
a tree with k + 1 nodes, is at most k. This is be- 
cause one processor is associated with each tree 
that arises by deleting one leaf from t f . A pro- 
cessor associated with a tree with k nodes will 
access the element TermCount(u)[i] of k other 
processors. Therefore a processor u will need 
to wait at most Im cycles to access the entry 
TermCount(V)[i], and will need to access at most 
m such entries for each i . 

Second, using Brent’s algorithms for the par- 
allel computation of arithmetic expressions [1], 
it is possible to compute t ) in parallel. Let 
( 7 = <fi(p) and recall that the number of op- 
erations to compute Tp(cr) is 0(m\a\N m ) by 
Proposition 7. Therefore, given sufficiently 
many processors, 'ip(cr) can be computed in time 
0(log 2 (m|a|A m )). 

Proposition 9 The cost of computing <p(p) ac- 
cording to the algorithm in Figure 1 is 0(l 2 m 3 ). 

Proof: Step 0 and Step 3 take time 0(1). Step 1 
takes time 0(l 2 ). If t £ CTk(E\ , ..., Em) and 
u = u t , then the following estimate holds for the 
inner loop of Step 2. The outer loop is repeated 
m times. The next sequential loop is repeated 
l times. Since the length of ProcessorList is 


(* Step 1 *) 

Labellndex := 1; 
for i := 1 to / do 

TermCount(u,)[i] := 1; 

end; 

(* In Step 1, Ui denotes the tree with two nodes, 
in which the node other than the root is 
labeled with Labellndex(pi). *) 

(* Step 2 *) 

for Labellndex := 1 to m — 1 do 

for each active processor u = Ut for which 

t has Labellndex + 1 nodes do simultaneously 
for i := 1 to / do 

for all vl E ProcessorList(LabelIndex(p l ), 
TermCount(V)[i] := TermCount(u')[i] 
+TennCount(u)[ 2 ]; 

end; 

end; 

end; 

end; 

(* Step 3 *) 

for each active processor u = ut for which 
t has m + 1 nodes do simultaneously 
TreeCoeff icient(u) := 0; 
for i := 1 to l do 

TreeCoeff icient(u) := TreeCoeff icient(u) 
+C; * TermCount(u)[i]; 

end; 

end; 


Figure 1: The Parallel Computation of <f>. 


at most ra, the next sequential loop is repeated 
at most m times. By the first remark above, 
each of the at most m iterations of this loop will 
need to wait at most Im time units to execute. 
Therefore the total execution time for Step 2 is 
bounded by 0(/ 2 m 3 ). This completes the proof 
of the proposition. 

Recall that by Proposition 6, can be com- 
puted in serial time 0{lmm\). Comparing this 
to the cost of the algorithm above gives 

Theorem 10 

Costgerial ^algorithm (p) 

CoStparaUei (£- algorithm (p) 
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